DOCUMENT RESUME 



ED 397 635 



FL 023 728 



AUTHOR 

TITLE 

PUB DATE 
NOTE 
PUB TYPE 
JOURNAL CIT 



Strong, Gregory 

A Survey of Issues and Item Writing in Language 
Testing, 

Dec 95 
34p . 

Reports - Descriptive (141) — Journal Articles (080) 
Thought Currents in English Literature; v68 p281~312 
Dec 1995 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF01/PC02 Plus Postage. 

Cloze Procedure; Difficulty Level; ^English (Second 
Language); Foreign Countries; Interrater Reliability; 
’’'Language Tests; Multiple Choice Tests; Reading 
Tests; Second Language Instruction; Statistical 
Analysis; Test Construction; Testing; *Test Items; 
*Test Reliability; *Test Validity; Writing 
(Composition) ; ^Writing Evaluation 

Japan; Test of English as a Foreign Language; Test of 
English for International Communication 



ABSTRACT 

This paper traces developments in educational 
psychology and measurement that led to the Test of English as a 
Foreign Language (TOEFL) and the test of English for International 
Communication (TOEIC) and the application of educational measurement 
terras such as validity and reliability to testing. Use of a table of 
specifications for planning language tests, procedures for obtaining 
greater inter- and intra-rater reliability on composition tests are 
noted. An overview is given of recent criticisms of language testing 
in Japan. Considerations for item writing are examined briefly, and 
the major item types in language testing are described and discussed, 
including multiple choice (in reading and listening), matching 
(reading) , a variety of cloze procedures (reading and grammar) , 
scanning (reading), paraphrasing (reading and grammar), information 
transfer (reading and writing), editing, guided paragraph writing, 
paragraphs and essays, question and response (listening and 
speaking), paraphrase (listening), short talks (listening), and using 
a table (listening). Some simple statistical procedures for 
determining item difficulty and for discriminating top-scoring and 
low-scoring students are also offered. Contains 30 references. 

(MSE) 



* * * * * * * * * * * * * Vc it it Vc * Vc Vc Vc * Vc * * * Vc * * Vc * it * Vc * Vc it Vc Vc Vc * Vr Vc * Vc Vc it it it it it it it it it it it it it it it it it it it it it it it it it 

Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 

Vc Vc Vc * * * Vc Vc Vc Vc Vc Vc Vc Vc Vc it it it it it Vc Vc it it it it it it it it it Vc Vc Vc Vc Vc Vc Vc Vc it Vc Vc Vc Vc Vc Vc Vc it Vc Vc Vc it it Vc Vc Vc Vc Vc Vc * Vc Vc Vc Vc Vc Vc Vc Vc Vc it it 



LOX 3 ?^ 



A Survey of Issues and Item Writing 
in Language Testing 



C\ 

ro 

Q 

w 



Gregory Strong 



Sss s s^ r rr cE 




U.S DEPARTMENT OF EDUCATION 
Otlico of Educational Research and Improvomont 

EDUCATIONAL RESOURCES information 
f\ CENTER (ERIC) 

h/This document has been reproduced as 
\ received from the person or organization 
originating it 

□ Minor changes have been made to 

imnmuo rpnroriurtion Quality 



• Points of view or opinions stated in this 
document do not necessarily represent 
official OERl position or policy 



BEST COPY AVAILABLE 



Thought Currents in English Literature 
Volume LX VI II, December 1995 



The English Literary Society of 
AO Y A MAG A KU 1 N UNIVERSITY 



ERIC a 



A Survey of Issues and Item Writing 
in Language Testing 



Gregory Strong 



Introduction 

This paper traces developments that led to the TOEFL and 
TOEIC and the application of educational measurement terms such 
as validity and reliability to testing. The use of a table of specifica- 
tions in planning a language test is discussed as are procedures for 
obtaining greater inter-rater and intra-rater reliabilities in composi- 
tion tests by such means as using a holistic marking scale, sample 
papers, and rater training. An overview of some recent criticisms of 
language testing in Japan is presented as well as a review of multiple 
choice items, four types of cloze reading tests, and other examples of 
potential questions for reading, writing, and listening tests. Some 
simple statistical procedures for determining the difficulty of a ques- 
tion and discriminating between top scoring and low scoring stu- 
dents are outlined. 

I. Language Tests 

English language testing has improved steadily with the introduc- 
tion of new tests, and refinements in testing administration, and in 
analyses and critiques of tests, and of particular types of questions or 
test items. Another factor has been the shift in language teaching 
methodology. The movement has been away from the classical gram- 
mar-translation method and the audio-lingual approach emphasizing 
listening comprehension to one advocating a communicative class- 

I 281 I 




282 



Gregory Strong 



room methodology. In this approach, teachers try to engage students 
in classroom activities where they use the language to actually com- 
municate meaningful information instead of engaging in translation, 
linguistic analysis, or in repetition and drill activities. 

Among the first language tests of English as a foreign language 
were those developed in Britain. Among these were the Certificate of 
Proficiency in English (CPE) in 1913 and the First Certificate En- 
glish Test (FCE) in 1939, introduced by Cambridge University. The 
university had already began developing national exams in Britain in 
1858 (Bachman, Davidson, Ryan, Choi, 1995, pp. 2, 3). Early tests 
like the CPE and the FCE required language students to write com- 
positions, to translate passages, and to take dictations. The tests and 
their results were disseminated throughout the British Common- 
wealth and helped establish standards for many educational programs 
around the world. Since then, the CPE and FCE have been im- 
proved considerably. As well, Cambridge tests of ability at other 
levels have been introduced, the Preliminary English Test (PET), and 
the Key English Test (KET) at the lower levels, and the Certificate 
of Advanced English (CAE) at the level of proficiency between the 
FCE and the CPE. 

The other major development in English language testing came 
from the United States where language testing began later than in 
Britain. It started in 1930 in response to rapid increases in the num- 
ber of student immigrants. The first tests were composed of reading 
passengers with true and false questions, a short composition, a dic- 
tation, an oral test, and a 250-300 word composition (Bachman, 
Ibid., pp. 3, 4). 

New ideas in educational psychology, measurement and testing led 
to the articulation of a rational process of curriculum design (Tyler, 
1949), and the creation of a taxonomy of educational objectives by 
researchers such as Bloom (1956). In turn, psychometric measures 
and linguistic principles were applied to language testing. Tests cm* 



O 




4 



283 



A Survey of Issues and Item Writing in language Testing 

ployed multiple choice items and concentrated on specific lexical and 
structural points, a focus later to become known as “discrete-point” 
testing. Educational Testing Services (ETS), Princeton, New Jersey 
which had been established in 1948, created the Test of English as a 
Foreign Language fTOEFL) in 1963 to help American universities 
place foreign students into their programs. In 1991, some 741,000 
students wrote the test and more than 2,400 universities in Canada 
and the U.S. used its scores to place students, making it the single 
most influential language test in the world (Pierce, 1992, p. 665). 

Meanwhile, in Japan, requests from the Japanese Ministry of In- 
ternational Trade and Industry in the 1970s led to ETS developing 
the Test of English for International Communication (TOEIC) for 
the Japanese market. Subsequently, this test was used by Japanese 
corporations in assessing the English language abilities of their em- 
ployees. Although both the TOFTC and the TOEFL were designed 
by ETS, the TOEFL uses reading passages and situations found in 
academic discourse, and the TOFTC employs the language and vo- 
cabulary of business English and of commonplace situations. 

The STBLP test, or Eaken is another test that was introduced in 
1963. Devised entirely within Japan by the Society of Testing English 
Proficiency, it is the most commonly taken test in Japan next to the 
TOEFL. Over 40,000,000 students, have taken it over the last 32 
years (Bostwick, 1995, p. 58). Unlike TOEFL, or TOEIC which arc 
norm-referenced tests comparing students, the STE1P test consists of 
six levels of achievement and students cither pass or fail at the level 
they elect to take. In this way, the STEP tests, like the Cambridge 
proficiency tests, are criterion-referenced. Students either pass or fail 
a level of English proficiency. 

II. Content Validity 

Central to improvements in testing have been the two concepts of 
“content validity” and “test reliability.” In short, these arc consider- 



b 



Gregory' Strong 



ZSA 

adons of whether or not a test measures what its designers planned, 
and to what degree the results of a test would be the same if it were 
administered again to the same group of students. 

A university entrance exam has content validity if it is made up of 
questions related to the activities and teaching materials used in 
courses at the university. Of course, not all of the aspects of a pro- 
gram can be covered in a single entrance test. However, representa- 
tive materials and skills should be part of the test. A test with good 
content validity will more accurately assess students' abilities in rela- 
tion to their future studies and place them more appropriately than 
otherwise. 

II. (a) Table of Specifications 

Exam specifications are published with each of the major tests 
discussed earlier, and also by many universities in the U.S., Britain, 
and Europe that have entrance examinations. These specifications 

TABLK OF SPECIFICATIONS 

TABLE OF SPECIFICATIONS 





O 3 J E < 


: T I V E 


s 






Total i 


c 

1° 




Identifying the 
Main Idea 


Paragraph 

Cohesion 


Using the 
Sentence 
Context 


Comprehension 


30 1 


N 

T 


Multiple 

Choice 


10 








10 j 


E 

N 


Mai chins 




5 






5 


| T 

Is 


Cioie 






10 




10 


| 


Open-ended 

Question 








5 


5 


(Si 


mlumy, 1985, 


p. JO) 




b 




A Survey of Issues and Item Writing in language Testing 285 

are of great use to students preparing for an exam and help reduce 
the possibility that they will get high scores by accident instead of by 
adequate preparation. 

Even more useful is a Table of Specifications (Hughes, 1989; 
Shohamy, 1985) indicating which skills are to be included on a test 
and by which kinds of test items these skills are to be measured. It is 
of great assistance in planning tests and in writing test items and 
discussing them. 

In the table, the grid relates a skill or type of knowledge to a 
question. The table illustrates the specifications for a 30-point read- 
ing test. The examinee is being tested for skills in finding the main 
idea in a reading passage, for reconstructing a narrative, for using the 
sentence context, and foi comprehending the key elements in a read- 
ing passage. Each of these skills is cross-referenced with the types of 
questions that will be used to assess it: multiple choice items, match- 
ing, cloze, and an open-ended question with a written response. Ide- 
ally, there should be variety in both the skills being tested and in the 
. item types on the test in order to assess a broad range of student 
abilities. The relative weight of each skill should reflect its impor- 
tance in the language program. 

11. (b) Other Types of Validity 

The validity of a test may be measured in a simple, non-statistical 
way after the students have entered the program. At the end of the 
term, one could compare the students’ classroom results with their 
scores on the tests. One would expect the highest scoring students to 
do better than the other students in their classes. If this were the 
case, then the test would have a high level of content validity because 
it had predictive validity in indicating students* future scores. 

The test would have “concurrent” or “criterion validity” if its re- 
sults were similar to another test measuring the same skills. It would 
be expected that two different tests of reading comprehension would 



E rjc^EST COPY AVAILABLE 



7 



286 



Gregory Strong 



show a high degree of correlation between students’ marks. The 
high-scoring students on one test should do well on the second test. 
However, students’ results on a reading test would not likely corre- 
late with their score on a writing test, or on a test of another skill. 

III. Reliability 

Reliability in testing comes from the concern that there will be 
inconsistencies in test results and that there always will be a margin 
of error in reporting test scores. There are five basic types of reliabil- 
ity, several of which can be tested statistically. 

The first is a test-retest. This is a hypothetical question about the 
degree of correlation between test scores if students took the same 
test twice. In every test, there would be differences in scores due to 
chance, and perhaps due to error in administering the text. The 
correlation between the two sets of scores is the degree of reliability 
of the test’s administration. Administering the test twice would have 
many obvious drawbacks, not the least being that students would 
recall many of the questions from taking the test initially. 

A statistical procedure has been developed to determine this kind 
of reliability. This is the split halves method. Each examinee is given 
two scores, one for the even numbered questions on the test and one 
for the odd-numbered questions. The correlation between the scores 
on these two half-tests, or the split halves of the test, can be calcu- 
lated with the Spearman-Brown formula. 

The second type of reliability is that of parallel forms which is the 
extent to which any two forms of the same test measure the same 
skills or traits. The third is the internal consistency of a test, the 
degree to which test questions are related to one another and mea- 
sure the same skills or traits. The internal consistency of a test can be 
measured through first calculating the standard deviation for the test 
and then using the Kudcr Richardson 21 statistical measure of reli- 
ability. 

8 



I 



A Survey of Issues and Item Writing in Language Testing 287 

III. (a) Inter-Rater and Intra-Rater Reliabilities 

The other two kinds of reliability on a test are “inter-rater reliabil- 
ity” and “intra-rater reliability.” These two types of reliability refer to 
the scoring done on subjective tests. These are open-ended questions 
requiring written answers, usually paragraph or essay questions. In- 
ter-rater reliability is the degree to which two different raters or 
markers agree on a score for a student paper. Intra-rater reliability is 
the extent to which one rater or marker scores consistentiy from one 
student’s paper to another. 

In terms of these latter two types of reliability, there is overwhelm- 
ing evidence that the scoring of writing is very unreliable unless 
certain procedures are followed. These procedures include (1) setting 
the scoring criteria in advance, (2) providing sample answers for the 
markers, (3) training the markers to use the criteria, (4) scoring each 
paper twice, and a third time if there is too much difference in the 
scores attributed to the same paper. These procedures are well-estab- 
lished in the field of English composition research (Braddock, Lloyd- 
Joncs, & Schoer, 1963; Cooper, 1977; Diderich, 1974; Myers, 1980). 

To demonstrate the unreliability of marking papers unsystemati- 
cally, none of these procedures were used in an experiment in the 
MA TESOL program in the Testing and Evaluation Unit at Reading 
University, England. In this wellknown experiment, twenty-two MA 
students scored eight papers between 1 and 20 points (Weir, 1993, p. 
155). 

The table lists the 22 scorers on the lefthand column. At the 
bottom of the column is the range of scores assigned to each paper 
and the mean score for each paper. On the righthand column is the 
mean score given to the papers by each rater and the range of scores 
each rater gave to the eight papers. 

It can be seen from the table that there is a large range of scores 
assigned to any one paper. Paper 8# was given a low score of 5 
points and a top score of 20. The mean score for this paper is 15 



288 



Gregory Strong 





PAPER NUMBERS 

1 2 3 


4 


5 


6 


7 


8 


mean 


range 


RATERS 




















A 


8 


12 


12 


13 


15 


8 


14 


16 


12 


8-16 


B 


7 


11 


12 


13 


14 


7 


14 


15 


12 


7-15 


C 


5 


12 


11 


9 


9 


4 


11 


9 


9 


4-12 


D 


9 


10 


14 


14 


14 


6 


16 


19 


13 


6-19 


E 


9 


15 


15 


11 


14 


8 


16 


16 


13 


8-16 


F 


7 


10 


11 


12 


13 


14 


15 


12 


12 


7-15 


G 


4 


10 


15 


5 


12 


3 


18 


19 


11 


4-19 


H 


7 


11 


10 


8 


12 


6 


17 


11 


11 


6-17 


I 


12 


14 


17 


10 


19 


10 


17 


17 


15 


10-19 


J 


5 


2 


3 


2 


5 


1 


18 


5 


5 


1-18 


K 


8 


12 


14 


5 


10 


13 


6 


10 


11 


6-15 


L 


8 


9 


11 


11 


13 


9 


15 


15 


11 


8-15 


M 


5 


12 


15 


8 


15 


9 


16 


14 


12 


5-16 


N 


4 


10 


12 


12 


15 


3 


18 


20 


12 


4-20 


O 


7 


10 


10 


10 


12 


15 


16 


18 


12 


7-18 


P 


4 


7 


12 


9 


10 


3 


14 


17 


10 


4-17 


Q 


5 


7 


10 


8 


9 


3 


11 


13 


8 


3-13 


R 


3 


8 


9 


9 


7 


4 


17 


15 


9 


3-17 


S 


8 


10 


15 


10 


12 


8 


15 


15 


12 


8-15 


T 


3 


3 


5 


5 


6 


2 


8 


14 


5 


2-10 


U 


12 


14 


16 


13 


12 


3 


19 


18 


13 


3-18 


V 


10 


14 


17 


14 


13 


8 


18 


18 


14 


8-18 


r. 


3-12 


2-15 


3-17 


2-15 


5-19 


1-15 


6-18 


5-20 






m. 


7 


11 


12 


10 


12 


7 


15 


15 







(Weir, 1993, p. 155). 



points suggesting that it is a good, passing paper because so many 
raters gave it a high score. However, rater J scoring it at 5 points fails 
it. Even the smallest range of marks for a paper is considerable. 
Paper 1 # has a mean score of 7 points and is likely a poorly written 
paper. It was given a low score of 3 and a top score of 12 which not 




289 



A Survey of Issues and Item Writing in Language Testing 

only passes the paper, but is a higher score than the score some 
raters gave Paper 8#. It can be seen that there is little inter-rater 
reliability between the markers. 

This experiment indicates that even well-educated, experienced 
markers with expertise in EFL such as these graduate students will 
score papers inaccurately without adequate criteria and rating proce- 
dures. 

Although this example does not demonstrate the problem of intra- 
rater reliability from one paper to another, this has been well-estab- 
lished in the research literature in composition in a first language. 
Coffman and Kurfman (1968) show that marking behaviour in a 
single rater changes over the marking period. This also is well-estab- 
lished by others (Braddock, Lloyd-jones, & Schoer, 1963; Cooper, 
1977; Diderich, 1974; Myers, 1980). 

II J. (a) (i) A Holistic Scale 

These researchers (Ibid.) suggest the use of a holistic or general 
impression marking scale for scoring papers. The markers form a 
“holistic” or general impression of each paper’s content, organiza- 
tion, sentence structure or style, and its written expression or use of 
grammar. The scales used are commonly five-point, six-point, or 
twelve-point scales. The smaller the range of scores on a scale, the 
greater the reliability in marking. This is because it is more likely that 
two raters will assign the same score to a paper if they are using a 
five -point scale than a twelve-point one. Afterward, the students’ 
marks for that portion of the test can be scaled to represent a larger 
portion of their exam marks than five or twelve points. 

One of the better known scales currcndy in use is the one devel- 
oped by ETS for use with the Test of Written English (TWH). The 
TWE was developed to meet the need for an essay test in some 
university admission requirements. 

This six-point scale was modified by Strong (1990) and subse- 



290 



Gregory Strong 



quendy employed by the English Department of Aoyama University 
in the writing assessment portion of the placement test of the Inte- 
grated English program in 1995. There are six bands on the scale. 
There is a description of the content, organizational patterns, the use 
of paragraph transitions, and effective sentence structure, and gram- 
mar for each band. The bands are as follows: (6) Advanced student 
writer, (5) Good student writer, (4) Competent student writer, (3) 
Modest student writer, (2) Marginal student writer, (1) Limited stu- 
dent writer with descriptors for each band that outline the general 
features of a paper at that level. 

To properly train markers in using the scale, an outline of a com- 
plete answer at band 6 is devised. Then a committee selects a series 
of papers randomly and chooses among them for six anchor papers 
that the committee feels demonstrate the writing competencies at 
each of the different bands on the scale. Afterward, the raters exam- 
ine the six anchor papers and try to determine where each fits on the 
scale. The raters discuss their reasons for assigning their marks, and 
then they compare their results with those of the committee. 

Raters are asked to mark on general impressions and to avoid 
deducting points for individual grammatical errors such as spelling 
mistakes, or instances of incorrect subject-verb agreement, or any 
lack i topic sentences. The raters are to ask themseives if a paper 
that may be written by an Advanced student writer is thoughtful, 
well-organized, and has only minor errors, or if the paper seems to 
be written by a less advanced writer and fits elsewhere on the scale. 

A head rater works with small groups of raters, randomly checking 
each rater’s marked papers to determine if the rater has been using 
the scale correctly. Each paper is marked twice. If there is more than 
one point difference between the scores on a paper, then it is scored 
by a third rater, and usually the three scores are averaged. Once 
teachers are trained in using the scale, marking proceeds quickly and 
accurately with only a few minutes spent on each paper. There arc no 



A Survey of Issues and Item Writing in Language Testing 



291 



Advanced student writer 

•logical and persuasive argument 
-well-organized paragraphs 
-thoughtful ideas, names, details 
•appropriate transition words 
•minor errors in grammar and punctuation 
•interesting word choice' 



Good student writer 

•argument Is clear although obvious 
•an organized paragraph 
-suitable examples 

•few transitions and (ess varied sentences 
-errors in grammar and punctuation don't interfere 
with communication 



Competent student writer 

-an argument is apparent 
-one or two developed examples 
•simple transitions 

•grammatical errors sometime interfere with 
communication 



Modest student writer 

•badly organized paragraph 
•underdeveloped examples 
-repetitive word choice 
-minor and major errors in grammar 
•repetitive sentence structure 



2 

Marginal student writer 

•question answered very superficially 
•at times seems incoherent 
•underdeveloped paragraph 
-flawed sentence structure 
•very limited word choke 



limited student writer 

•inability to comprehend the question 
-severely underdeveloped paragraph 
•obscured meaning in the sentences 
•persistent major grammatical errors 



5 



4 



3 



ErJcST copy available 




292 



Gregory Strong 



comments or corrections made on any of the papers. 

IV. Toward Improvements in Validity and Reliability 

Aside from the statistical analyses to determine test validity, and 
Che treasures to improve reliability suggested earlier, there ate a num- 
ber of steps that can be taken to ensure better testing. Hughes (1989) 
outlines these: 

1. Plan the test systematically. 

2. Include a variety of item types on a test to assess a broad 
range of language skills. 

3. Identify candidates by number, not name. 

4. Do not allow candidates choices of questions as this makes it 
harder to compare candidates. 

5. Write test items with clear expectations. 

6. Provide good instructions, possibly in the candidates’ native 
language. 

7. Ensure that tests are well laid out and completely legible. 

8. Familiarize candidates with the format and testing techniques 
in advance, and provide sample questions. 

9. Provide testing conditions that are uniform and not distract- 
ing to the participants. 

10. Use items that encourage unambiguous scoring where pos- 



11. Provide a detailed scoring key specifying acceptable answers, 
and noting the points to be assigned for partially correct an- 
swers. 

12. Train raters where the scoring is subjective. 

13. In the case of subjective items such as open-ended questions, 
and extended writing, agree on the appropriate answers and 
scores before marking the tests. Use sample papers and train- 
ing sessions for these questions. 

14. Where testing is subjective, especially in paragraph and essay 
tests, use two raters. 



sible. 



(pp. 36-42.) 




ERIC 



293 



A Survey of Issues and Item Writing in Language Testing 



In describing the benefits of language testing, Brown (1995) notes 
that tests can be used to sort students according to their language 
abilities and create more homogenous classes which will be easier to 
teach. Brown maintains the tests should be adapted from existing 
tests, or developed exclusively for an institution in order to select the 
students most suitable for its programs. 

In this area, Japanese colleges and universities deserve consider- 
able recognition for developing tests that are unique to each institu- 
tion. Furthermore, the tests themselves are created cooperatively in 
exam committees and there is discussion and criticism of test items. 
These features of language testing in Japan arc very positive ones. 

However, Brown (1995) and other researchers (Bostwick, 1995; 
Brown and Yamashita, 1995) have several criticisms of examinations 
in Japan. Brown and Yamashita (Ibid.) analyzed the entrance exams 
at 21 private Japanese universities including Aoyama, Keio, Rikkyo, 
Sophia, and Waseda, and 10 public universities, among them, Kyoto, 
Osaka and Tokyo universities. The sources for their study were two 
commercially available books, Koko-Eigo Kcnkyu (1993), ’93 Shiritsu 
Daigaku-ben: Eigo Mondai no Tettiteki Kenkyu, Tokyo, Kcnkyusha and 
Koko-Eigo Kenkyu, ’93 Kokukoritsu Daigaku-ben : Eigomondai no 
Tetteiteki Kenkyu, Tokyo, Kcnkyusha. 

Brown and Yamashita (1995) based their analysis on exam item 
types, and the comparative difficulty of reading passages on exams. 
They used a computer spreadsheet p *ogram to code and count the 
types of questions on the different university exams. Afterward, they 
used the Que computer software (1990) Right Writer Intelligent Gram- 
mar Checker (version 4.0), Sarasota, Florida to analyze features in the 
reading passages on the exams. This software program calculates the 
number oi words, the syllables per word, the number of words per 
sentence, and the number of sentences. It also determines the read- 
ability of passages using the Flesch, Flcsch-Kincaid, and Fog read- 
ability indexes. 





ERIC 




294 



Gregory Strong 



Among the observations they made were that there were substan- 
tial differences between the reading sections of the exams. The pub- 
lic universities tended to have more reading passages, but of shorter 
length. The reading difficulty of the passages ranged from those 
appropriate for native speakers at sixth grade, in the case of Kansai 
University, to those suitable for third year university students in the 
case of the entrance exam at Nagoya University (Ibid., p. 89). As for 
item variety, Kangai and Sophia universities placed a heavy emphasis 
on multiple choice items while other universities such as Kyoto em- 
phasized translation (Ibid., p. 91). Furthermore, only four universi- 
ties, Aoyama University and Tokyo University among them, included 
listening items on their exams. This was despite recent Monbusho 
guidelines advocating more listening and speaking activities in En- 
glish instruction in Japanese junior and senior high schools (Ibid.). 

The researchers made several additional observations. New sets of 
directions had to be given often in exams. Test lengths also varied 
considerably as well. They suggested that students taking these exams 
would be confronted by too much variation in language testing and 
that this situation might discriminate in favour of students who were 
more test-wise rather than those who were better at using English. 
The researchers also suggested that translation activities, besides be- 
ing hard to grade, might be too difficult a skill to require cf students 
with only limited English study in junior and senior high school. 

Finally, Brown and Yamashita (1995) criticize the universities in 
their study because none of them do any of the statistical analyses of 
reliability and validity of their language tests that are common prac- 
tice elsewhere. They suggest that Japanese universities either follow 
the guidelines established by the Committee to Develop Standards 
for Educational and Psychological Testing. (1985). Standards for Edu- 
cational and Psychological Testing, Washington, D.C.: American Psycho- 
logical Association or adapt these to Japan (Brown & Yamashita, 
Ibid., p. 98). 




lb 




A Survey of Issues and Item Writing in Language Testing 



295 



Bostwick (1995) makes a similar criticism of the Eiken STEP and 
the Jido Eiken STEP tests. He argues that although they are profi- 
ciency tests, there is no information available on their validity and 
reliability. There is no explanation of how levels or passing scores are 
calculated. As a result, it is not possible to learn whether the tests 
successfully distinguish between several levels of language perfor- 
mance and whether these levels are consistent from test administra- 
tion one year to the next. 

V. Item Writing: Reading, Writing, Listening 

Under the impact of the communicative language teaching meth- 
odology, language test items are changing. Test items in the past were 
almost exclusively of the discrete-point type where specific language 
points such as vocabulary items, and verb conjugations were tested. 
But now tests include language tasks where students complete activi- 
ties that include several different language skills and may be based on 
real-life activities such as reading signs, and brochures, following 
directions, note-taking, and writing different kinds of compositions 
such as paraphrases, summaries, and statements of opinion 
(Shoharuy, 1985). 

In addition, many test items used to be based on indirect measures 
of language ability such as a knowledge of grammar being used to 
test a student’s writing ability. These items are being replaced by 
more direct measures of writing such as requiring students to write 
compositions. 

Productive language skills such as speaking and writing also are 
being tested more extensively than before. Both skills arc measured 
in such contemporary tests as the Cambridge series of language pro- 
ficiency tests mentioned earlier, and the CAN test, a Canadian-devel- 
oped test of language skills created at the University of Ottawa. The 
same is true of the TOEFL which has introduced two additional 
tests, the Test of Written English (TWE), and the Test of Speaking 




296 



Gregory Strong 



English (TSE). Furthermore, ETS is planning a major revision ot 
TOEFL exams in the TOEFL Year 2000 project to change the ex- 
amination into a more communicative, task-oriented one (Brown & 
Yamashita, 1995). 

In general, several considerations apply when designing test items 
in reading and listening. According to Shohamy (1985) and Hughes 
(1989), these comprise (1) the importance of including different 
types of reading texts, or listening materials, (2) the use of authentic 
texts, and of real-life tasks wherever possible. Finally, (3) item design- 
ers should not attempt to find too many questions about a single 
reading passage or listening text. The use of a broad range of subjects 
and types of questions provides each exam, e with what Hughes 
(Ibid.) calls “fresh starts” and taps different language abilities. 

The remainder of this paper will outline some of the major types 
of items in language tests. These are items used in assessing reading, 
writing, and listening skills and do not include oral interviews and 
tests. 

V. (a) Multiple Choice (Reading and listening) 

The continued attraction of multiple choice test items lies in their 
unambiguous answers, the comparative ease with which they are 
scored, and their statistical reliability. They are now used for a variety 
of question types of reading and listening skills. But their original 
purpose was for assessing terminology, facts, classifications, and 
other discrete areas of knowledge (Gronlund, 1977). 

A multiple choice item consists of a stem, a correct answer, and 
three or four alternatives or distractors. As far as possible, the stem 
should be written in simple, clear language and most of the wording 
of the question should be in the stem. The item difficulty is con- 
trolled by varying the problem in the stem or by changing the alter- 
natives. The answer and the distractors must be grammatically con- 
sistent with the stem and parallel in length and grammatical structure 







A Survey of Issues and Item Writing in language Testing 297 

and the distractors must all be plausible answers to the uninformed 
(Ibid., p. 45, 49). 

The problem with multiple choice items is that they encourage 
guessing and a student could score as high as 33% just by chance 
(Hughes, 1989). Although guessing is a factor on other test items, the 
effect is much less. Hughes (Ibid.) also notes that the item restricts 
what may be tested and that it is very difficult to write plausible 
alternatives to the correct answer. 

However, Heaton (1988) contends that these test items can still be 
effective in discriminating between students, especially if they are 
pre-tested on a representative sample of the test population. The 
latter precaution will help in gauging the difficulty of the test and can 
be used to compare a test with those of previous years. Heaton 
(Ibid.) counters the criticism of guessing by the observation that 
examinees rarely make wdld guesses, but usually base their choices on 
partial knowledge of a question anyway (p. 28). Heaton recommends 
four distractors for grammar questions, and five for vocabulary and 
reading questions (Ibid.). 

V. (b) Matching (Reading) 

Typically, the matching question is a modification of the multiple 
choice form where all the stems or premises are listed in one column 
on the right and a longer list of distractors, called responses is listed 
in a column to the right (Gronlund, 1977). In matching questions, 
the lists should be short, and each response should be a plausible 
alternative for ali of the premises. The factor of guessing is reduced 
in this type of question because there arc so many possible answers. 

One of the better known applications of matching test items is as 
a test of vocabulary and the use of context clues. Given a list of 
words at the end of a passage, students arc asked to find synonyms in 
the passage itself. A detailed context is supplied by the passage mak- 
ing this an economical method oi "esting vocabulary. 



298 



Gregory Strong 



group .band 

owned 

specific particular 

THE TEHUELCHES 

The Tehuelches lived in a band — usually of between fifty and a hundred 
people. Each band had exclusive rights to a particular hunting area... 

(Heaton, 1988, p. 60) 



Another application for matching questions is in a test of reading 
comprehension. The examinees have to select the appropriate phrase 
in order to create a cohesive expository passage. They are given 
several sentences at the beginning of the passage and at the end as 
well. 



IN SEARCH OF LANGUAGE’S MISSING LINK 

American linguists bdleve they are approaching their 
profession’s ultimate goal - the reconstruction of the ’mother tongue’, 
the language spoken by earliest humanity. The ancient words of the first 
human beings are about to be heard again, they say... 

...The human race was at that time Just a loose band of people Inhabiting 
a region of sub-Saharan Africa. 1 replacing neanderthals and 
other rivals, bearing our language round the world. 

As humanity spread out, this mother tongue divided into various 
dialects which in turn developed into new languages. 2 leading 
to the development of modern maiddnd’s many different languages 
ranging from Aborigine to Eskimo, from Serbo-Croat to Basque... 

A. Then we emerged, out of Africa, to take 
over the world 

B. This process was repeated over the 
centuries 

(Cambridge, t99t, CAE Paper 3, p. 6) 



In a similar type of question item, students may be asked to find 
the appropriate sentences to create a cohesive narrative text. Both 
types of test items require that the difficulty of the reading passage be 
appropriate for the students being tested and that the responses be 
thoroughly pre-tested, preferably on a sample group of students. 




BEST COPY AVAILABLE 



299 



A Survey of Issues and Item Writing in Language Testing 

RARITY 

As we threaded our way down and down, one of our party stopped and 
knelt. There, hanging downwards on a dead branch on the forest floor, 
was what looked like a large, (fried and blackened flower, two petals 



partially open. I was about to move on when the petals quivered weakly 
and a bright, unwinking eye gazed at roe from the flower’s base. 




I gently cupped my hand around the swift and lifted It, wet and shivering 
minutely. Obviously it had been knocked out of the sky by the recent 
storm. Falling drenched and helpless into the forest, the bird had tried 
to regain Us habitat by climbing the branch - a brave but hopeless 
attempt... 



A 

In that instant the argument bttween the scientist and the conservationist in 
me was decided. 



B 

Abruptly the image reversed itself, as illusions do, and U became a 
bird... one of the swifts - incongruous, the most aerial of all birds, stranded 
deep in the forest. It was as strange as finding a whale. How had it 
reached this nadir? 

(Cambridge, 1991, CAE, Paper 1, p. 4, 5) 



V. (c) Cloze Tests (Reading and Grammar) 

In cloze tests, the examinees are given a passage from which 
words have been replaced by blanks and they have to decide which 
word best fits each blank. The more skilled the language learner, the 
better able he or she will be at choosing the best word for each 
blank. When a reading appropriate to the level of the students is 
chosen, this test has a high degree of reliability. 

One of the advantages of cloze tests is that an open-ended cloze 
test (as opposed to a multiple choice cloze test with distractors) is an 
easy test to construct. It can also be used as a effective substitute for 
grammar tests because students arc given an actual language sample 
and are presented with a full range of structural questions from verb 
choice, and use of tense, prepositions, and articles to questions of 



21 



300 



Gregory Strong 



semantics and rhetoric. 

Cloze passages usually are constructed by leaving the first few 
sentences of a reading intact. Several sentences are left at the end of 
the passage in order to provide a complete context. 

Often, there is a total of 30-50 blanks left to be completed 
(Ikeguchi, 1995, p. 168). And cloze passages are of four types: fixed 
rate, rational deletion, multiple choice, and c-test. 

V. (c) i Fixed Rate Cloze 

In this kind of cloze test, the test items are created by deleting 
words at regular intervals, every fifth word, or more commonly, ev- 
ery seventh word. The more frequendy words are deleted, the more 
difficult the passage becomes (Brown, 1988, 1983). 

V. (c) ii Rational-Deledon Cloze 

In this type of cloze test, different types of words are deleted to 
test different aspects of the examinee’s knowledge of English. To 
find out the answers to the test items, candidates must look within 
the clauses where the blank appears, within the sentence, or within 
the paragraph itself. In this manner, this test is of students’ abilities 
to read at semantic, syntactic, and paragraph levels of comprehen- 
sion. 



Water, soil and the earth’s green mantle of plants make up the world 
that supports the animal life of the earth. Although modem man seldom 
remembers the fact, he could not exist without the plants that harness 
the sun’s energy and manufacture the basic foodstuffs he depends 
twon for life. Our attitude toward _ plants Is a singularly 

narrow 3 . If we see any immediate utility in 4 plant we 

foster It... 

(Hughes, 1989, p. 66) 

V. (c) iii Multiple Choice Cloze 

In construction, this type of cloze test is either the fixed rate 
deletion or rational deletion type. It has the same advantages and 



22 BEST COPY AVAILABLE 




301 



A Survey of Issues and Item Writing in Language Testing 

disadvantages of multiple choice items. Because the choice of poten- 
tial answers is supplied, students finish the test far more quickly than 
if they were doing an open-ended cloze test. 

But the multiple choice cloze test is far more difficult to create 
than an open-ended test because the distractors must be written as 
well. This usually requires pre-testing to select suitable distractors. A 
quick, effective way to create these distractors is to give the test as an 
open-ended cloze test to a sample population and use their responses 
as the basis for test items. Alternately, one might write distractors 
that all use the same part of speech as the correct answer. Either 
method has been found to have a high degree of reliability (Ikeguchi, 
1995). 

V. (c) iv Modified C-Test 

As with the other cloze tests, the first few sentences and the last 
few sentences are left intact in the passage to give the examinees a 
complete context. The c-test is a grammatically-based, modified cloze 
test where the second half of every second word is deleted, (exclud- 
ing numbers and proper names). The c-test is very easy to construct, 
and although open-ended, is easy to score because there is usually 
only one acceptable answer for each question. 

A FIRE ENGINE CREW 

There are usually five men In the crew of a f.re engine. One o£ them 

dri yfj the en gine . The leader h 5 usually be 6 In t 7 
Fire Scr S for ma_J_ years. H 10 will kn 11 how t 12 fight 
diff 13 sorts o 14 fires. S 15 . when t 16 firemen arr 17 at a 
fire, it b always the leader who decides how to fight a fire. He tells each 
fireman what to do. 

(Klein -Bralcy and Raatz, 1984) 

To improve on its reliability over other cloze tests, a c-test usually 
includes about six different short passages in a test with about 100 
deletions altogether (Ikeguchi, 1995). Narrative and explanatory tests 
tend to be more accurate than passages of argument and description 



23 



